Most of people care about the price of residential properties. Washington D.C. is the capital of the US. It is interesting to take a view at the price of DC residential properties. Intuitively, DC is a city that has faced one of the highest price level in the US. Generally, in 2017, the population in DC was nearly 700,000 and the average price of a single family was approximately 649,000. In this report, we aim to find what factors influence the price significantly.
Link: https://www.kaggle.com/christophercorrea/dc-residential-properties
The main source of our data is a csv file in kaggle, which is the 7th version updated in July 2018.
Link: https://data.worldbank.org/indicator/fp.cpi.totl.zg
Another dataset about inflation is a csv file in World Bank Open Data, which is updated in January 2019.
This dataset has 49 variables including price. Apart from price, these variables can be divided into four groups: attributes of house itself, time related factors, assessment of third party, and location. For the full name of these variables, these infomation is introduced in detail in https://www.kaggle.com/christophercorrea/dc-residential-properties.
#str(rawrp)
names(rawrp)
## [1] "X.1" "BATHRM" "HF_BATHRM"
## [4] "HEAT" "AC" "NUM_UNITS"
## [7] "ROOMS" "BEDRM" "AYB"
## [10] "YR_RMDL" "EYB" "STORIES"
## [13] "SALEDATE" "PRICE" "QUALIFIED"
## [16] "SALE_NUM" "GBA" "BLDG_NUM"
## [19] "STYLE" "STRUCT" "GRADE"
## [22] "CNDTN" "EXTWALL" "ROOF"
## [25] "INTWALL" "KITCHENS" "FIREPLACES"
## [28] "USECODE" "LANDAREA" "GIS_LAST_MOD_DTTM"
## [31] "SOURCE" "CMPLX_NUM" "LIVING_GBA"
## [34] "FULLADDRESS" "CITY" "STATE"
## [37] "ZIPCODE" "NATIONALGRID" "LATITUDE"
## [40] "LONGITUDE" "ASSESSMENT_NBHD" "ASSESSMENT_SUBNBHD"
## [43] "CENSUS_TRACT" "CENSUS_BLOCK" "WARD"
## [46] "SQUARE" "X" "Y"
## [49] "QUADRANT"
For the time part, we convert the date format (dd/mm/yyyy) into year format (yyyy) and then calculate the time span. For example, the time span of the building (SPAN_AYB) is from the year the structure was first built to the last date of sale. Similar for the time span of the year remodeled (SPAN_YR_RMDL) and the time span of an improvement was built (SPAN_EYB). Then, we can use these time spans numerically. We also set a new variable that is price per square foot, which is a common-use measurement.
There are so many redundant variables, such as latitude & longitude and X & Y. In this report, we ignore nearly half of the variables here.
prerp <- rawrp[-c(1,18,28,30:36,38:40,44,46:48)]
prerp$SALE_YR <- as.integer(format(as.Date(prerp$SALEDATE),'%Y'))
prerp$SPAN_YR_RMDL <- prerp$SALE_YR - prerp$YR_RMDL
prerp$SPAN_AYB <- prerp$SALE_YR - prerp$AYB
prerp$SPAN_EYB <- prerp$SALE_YR - prerp$EYB
#names(prerp)
prerp <- prerp[-c(8:10,12)]
prerp$BATHRM <- prerp$BATHRM + 0.5*prerp$HF_BATHRM
prerp <- prerp[-c(2)]
prerp$SALE_NUM <- as.factor(prerp$SALE_NUM)
prerp$PRICE_GBA <- prerp$PRICE/prerp$GBA
names(prerp)
## [1] "BATHRM" "HEAT" "AC"
## [4] "NUM_UNITS" "ROOMS" "BEDRM"
## [7] "STORIES" "PRICE" "QUALIFIED"
## [10] "SALE_NUM" "GBA" "STYLE"
## [13] "STRUCT" "GRADE" "CNDTN"
## [16] "EXTWALL" "ROOF" "INTWALL"
## [19] "KITCHENS" "FIREPLACES" "LANDAREA"
## [22] "ZIPCODE" "ASSESSMENT_NBHD" "ASSESSMENT_SUBNBHD"
## [25] "CENSUS_TRACT" "WARD" "QUADRANT"
## [28] "SALE_YR" "SPAN_YR_RMDL" "SPAN_AYB"
## [31] "SPAN_EYB" "PRICE_GBA"
Following, we only select the complete data to use. One thing should be noted, the factor variable air conditioner (AC) has three levels: 0, N, Y. Hereby, “0” means the data is not available, which is the same as NA. As a result, we narrow the dataset down from 150k rows to 33k rows.
nrow(prerp)
## [1] 158957
levels(prerp[,3])[1] <- NA #AC = "0" is NA
prerp <- na.omit(prerp)
#prerp <- prerp[!is.na(prerp$PRICE_GBA),]
nrow(prerp)
## [1] 33162
We have finished the basic cleaning, but it is still a problem to see some extreme outliers. If we trivially use the outlierKD function w.r.t price to remove the outliers, we will find the upperbound of the housing price is only 1,000,000 USD. From the common knowledge, we know the average housing price in DC is over 600k USD and there are so many properties sold over 1,000,000 USD. Additonally, from the plot of price or price per square foot (scrolling down to see), these data is highly skewed to the right. Thus, using outlierKD function may remove too many useful data on the right and too little on the left.
Therefore we use this following method to remove more outliers on the left than the right, with 5% and 0.1% respectively. Although this approach is not very accurate and well-proved, it is still reasonable comparing to outlierKD function.
p_lowerbound <- sort(prerp$PRICE_GBA)[nrow(prerp)*0.05] # impossible PRICE_GBA <= 80
p_upperbound <- sort(prerp$PRICE_GBA)[nrow(prerp)*0.999] # still possible >= 1000
r_lowerbound <- sort(prerp$ROOMS)[nrow(prerp)*0.001]
r_upperbound <- sort(prerp$ROOMS)[nrow(prerp)*0.999]
rp <- subset(prerp, PRICE_GBA >= p_lowerbound & PRICE_GBA <= p_upperbound & ROOMS >= r_lowerbound & ROOMS <= r_upperbound & KITCHENS <= 4)
rp_num <- data.frame(sapply(rp, as.numeric)) # for correlation part
There are the extreme cases if we do not clean the outliers. One impossible instance is price is only 1 USD with 7 rooms. Besides, we also do not care about the luxury properties in this report.
price_extreme <- subset(prerp, PRICE_GBA <= p_lowerbound | PRICE_GBA >= p_upperbound)
subset(price_extreme, PRICE_GBA <= 1 | PRICE_GBA >= 3000) #very extreme cases
## BATHRM HEAT AC NUM_UNITS ROOMS BEDRM STORIES PRICE
## 3331 2.5 Hot Water Rad N 1 9 4 2 936
## 4200 1.5 Hot Water Rad Y 1 6 3 2 1377
## 18569 2.5 Hot Water Rad N 2 8 3 2 10
## 28637 2.5 Forced Air Y 1 5 0 1 3675000
## 42070 3.5 Forced Air Y 3 8 3 3 250
## 48415 2.0 Forced Air N 1 8 5 3 250
## 49322 2.0 Forced Air Y 1 9 6 3 250
## 54230 4.0 Forced Air Y 2 8 4 3 500
## 54910 6.0 Forced Air Y 2 10 6 3 250
## 55049 3.0 Hot Water Rad Y 1 9 5 2 250
## 81695 1.5 Hot Water Rad N 1 7 3 2 1
## 84166 2.0 Hot Water Rad Y 2 8 2 2 250
## 90982 4.0 Forced Air Y 4 16 4 2 11000000
## 90983 4.0 Forced Air Y 4 16 4 2 11000000
## 100486 4.0 Hot Water Rad Y 4 16 4 2 1000
## 102783 1.0 Air Exchng N 1 5 2 2 250
## 106684 3.0 Forced Air N 3 14 6 3 1
## QUALIFIED SALE_NUM GBA STYLE STRUCT GRADE
## 3331 U 2 2142 2 Story Row Inside Good Quality
## 4200 U 1 1424 2 Story Row Inside Good Quality
## 18569 U 2 2272 2 Story Multi Superior
## 28637 Q 2 994 1 Story Single Superior
## 42070 U 1 3179 3 Story Row Inside Above Average
## 48415 Q 1 2281 3 Story Row Inside Good Quality
## 49322 U 1 2990 3 Story Row Inside Good Quality
## 54230 U 1 2421 3 Story Row Inside Good Quality
## 54910 U 1 2596 3 Story Row Inside Good Quality
## 55049 U 1 1462 2 Story Row End Good Quality
## 81695 U 1 1420 2 Story Single Above Average
## 84166 U 1 1298 2 Story Row Inside Average
## 90982 U 2 3440 2 Story Multi Average
## 90983 U 2 3400 2 Story Multi Average
## 100486 U 1 2736 2 Story Multi Average
## 102783 U 1 1132 2 Story Semi-Detached Average
## 106684 U 1 2150 3 Story Multi Average
## CNDTN EXTWALL ROOF INTWALL KITCHENS
## 3331 Average Common Brick Metal- Sms Hardwood 1
## 4200 Average Common Brick Built Up Hardwood 1
## 18569 Average Common Brick Built Up Hardwood 2
## 28637 Very Good Stucco Slate Hardwood 1
## 42070 Good Common Brick Metal- Sms Hardwood 3
## 48415 Average Common Brick Metal- Sms Wood Floor 1
## 49322 Average Common Brick Metal- Sms Hardwood 1
## 54230 Average Common Brick Built Up Hardwood 2
## 54910 Average Common Brick Built Up Hardwood/Carp 2
## 55049 Good Common Brick Built Up Hardwood 1
## 81695 Average Common Brick Comp Shingle Wood Floor 1
## 84166 Average Common Brick Metal- Sms Hardwood 2
## 90982 Good Common Brick Built Up Carpet 4
## 90983 Good Common Brick Built Up Carpet 4
## 100486 Average Common Brick Metal- Sms Wood Floor 4
## 102783 Average Brick Veneer Metal- Sms Wood Floor 1
## 106684 Average Common Brick Built Up Hardwood 3
## FIREPLACES LANDAREA ZIPCODE ASSESSMENT_NBHD ASSESSMENT_SUBNBHD
## 3331 3 1398 20001 Old City 2 040 B Old City 2
## 4200 5 1360 20009 Old City 2 040 E Old City 2
## 18569 0 2400 20007 Georgetown 025 F Georgetown
## 28637 0 13940 20016 Wesley Heights 054 A Wesley Heights
## 42070 3 2700 20009 Columbia Heights 015 A Columbia Heights
## 48415 0 2417 20010 Columbia Heights 015 E Columbia Heights
## 49322 0 2853 20009 Columbia Heights 015 E Columbia Heights
## 54230 0 1170 20001 Ledroit Park 031 A Ledroit Park
## 54910 0 2520 20001 Ledroit Park 031 B Ledroit Park
## 55049 1 1152 20001 Ledroit Park 031 B Ledroit Park
## 81695 1 4400 20018 Michigan Park
## 84166 0 1040 20002 Old City 1 039 G Old City 1
## 90982 0 4116 20019 Fort Dupont Park 022 A Fort Dupont Park
## 90983 0 3504 20019 Fort Dupont Park 022 A Fort Dupont Park
## 100486 0 4224 20020 Anacostia 002 A Anacostia
## 102783 0 2216 20020 Anacostia 002 A Anacostia
## 106684 0 7811 20032 Congress Heights 016 A Congress Heights
## CENSUS_TRACT WARD QUADRANT SALE_YR SPAN_YR_RMDL SPAN_AYB SPAN_EYB
## 3331 4801 Ward 6 NW 1996 0 96 27
## 4200 4400 Ward 1 NW 2000 -4 100 40
## 18569 100 Ward 2 NW 2018 9 128 42
## 28637 801 Ward 3 NW 2016 13 10 4
## 42070 3700 Ward 1 NW 1999 -5 94 32
## 48415 3000 Ward 1 NW 1999 11 91 39
## 49322 3600 Ward 1 NW 1999 -2 89 39
## 54230 3400 Ward 1 NW 1999 -2 99 30
## 54910 3301 Ward 5 NW 1999 -13 93 24
## 55049 3301 Ward 5 NW 1999 -4 90 36
## 81695 9400 Ward 5 NE 2000 30 71 53
## 84166 7901 Ward 6 NE 1998 -1 89 37
## 90982 7703 Ward 7 SE 2017 15 74 53
## 90983 7703 Ward 7 SE 2017 15 74 53
## 100486 7504 Ward 8 SE 1999 -6 62 45
## 102783 7503 Ward 8 SE 1998 -5 93 44
## 106684 10900 Ward 8 SW 2000 1 47 29
## PRICE_GBA
## 3331 0.43697
## 4200 0.96699
## 18569 0.00440
## 28637 3697.18310
## 42070 0.07864
## 48415 0.10960
## 49322 0.08361
## 54230 0.20653
## 54910 0.09630
## 55049 0.17100
## 81695 0.00070
## 84166 0.19260
## 90982 3197.67442
## 90983 3235.29412
## 100486 0.36550
## 102783 0.22085
## 106684 0.00047
We can compare the differences of the distribution between the within outliers and without outliers.
Within outliers:
boxplot(prerp$PRICE, ylab='PRICE')
boxplot(prerp$PRICE_GBA, ylab='PRICE_GBA')
Without outliers:
boxplot(rp$PRICE,ylab='PRICE')
boxplot(rp$PRICE_GBA,ylab='PRICE_GBA')
From the plot above, although the plot is more reasonable after the first outliers cleaning, we still want to make the distribution looks like normal distribution. At this step, we call the outlierKD function and it removes the outliers with the upperbound over 1,600,000 USD. Also, it is better to do statistical inference with the normal distributed data.
outlierKD(rp, PRICE)
## Outliers identified: 1633 nPropotion (%) of outliers: 5.5 nMean of the outliers: 2461816 nMean without removing outliers: 707479 nMean if we remove outliers: 611350 nDo you want to remove outliers and to replace with NA? [yes/no]:
## Nothing changed n
rp <- na.omit(rp)
Now we try to draw a (Pearson) correlation matrix among these variables. Before that, we assume these variables are numeric.
It is obvious to see that price is highly positive related to the gross building area and some other attributes about rooms, which is consistent with our prior knowledge.
rp_corr <- cor(rp_num)
corrplot(rp_corr, method = "square")
After clearning all the data, we take a view at the basic statistics about this new dataset we will use in the report. In this dataset, the mean price is 611,350 USD, which is closed to value calculated by other authorities. The mean area is 1668 square feet. Besides, the median price and area is a little bit lower than the mean values. As for the deviation, it is quite large that the standard deviations are approximately half of the mean values both for price and area.
stat_rp <- stat.desc(rp)
show_stat <- stat_rp[sapply(stat_rp, is.numeric)]
show_stat
## BATHRM NUM_UNITS ROOMS BEDRM STORIES
## nbr.val 31435.0000 31435.0000 31435.000 31435.0000 31435.000
## nbr.null 1.0000 24.0000 0.000 4.0000 6.000
## nbr.na 0.0000 0.0000 0.000 0.0000 0.000
## min 0.0000 0.0000 3.000 0.0000 0.000
## max 11.0000 5.0000 20.000 15.0000 826.000
## range 11.0000 5.0000 17.000 15.0000 826.000
## sum 85760.0000 38175.0000 236437.000 110517.0000 67683.800
## median 2.5000 1.0000 7.000 3.0000 2.000
## mean 2.7282 1.2144 7.521 3.5157 2.153
## SE.mean 0.0063 0.0032 0.012 0.0064 0.029
## CI.mean.0.95 0.0123 0.0062 0.024 0.0126 0.057
## var 1.2388 0.3136 4.861 1.2956 26.135
## std.dev 1.1130 0.5600 2.205 1.1383 5.112
## coef.var 0.4080 0.4611 0.293 0.3238 2.374
## PRICE GBA KITCHENS FIREPLACES LANDAREA
## nbr.val 31435.00 31435.00 31435.0000 31435.0000 31435
## nbr.null 0.00 0.00 15.0000 15071.0000 0
## nbr.na 0.00 0.00 0.0000 0.0000 0
## min 45000.00 407.00 0.0000 0.0000 216
## max 11984000.00 11810.00 4.0000 13.0000 102340
## range 11939000.00 11403.00 4.0000 13.0000 102124
## sum 22239609301.00 55733190.00 39559.0000 24555.0000 99000459
## median 594900.00 1554.00 1.0000 1.0000 2066
## mean 707479.22 1772.97 1.2584 0.7811 3149
## SE.mean 3275.94 4.76 0.0033 0.0056 18
## CI.mean.0.95 6420.96 9.33 0.0064 0.0109 35
## var 337352735177.53 711879.37 0.3379 0.9774 9999155
## std.dev 580820.74 843.73 0.5813 0.9886 3162
## coef.var 0.82 0.48 0.4619 1.2656 1
## ZIPCODE CENSUS_TRACT SALE_YR SPAN_YR_RMDL
## nbr.val 31435.00000 31435.00 31435.000 31435.000
## nbr.null 0.00000 0.00 0.000 7681.000
## nbr.na 0.00000 0.00 0.000 0.000
## min 20001.00000 100.00 1991.000 -26.000
## max 20052.00000 11100.00 2018.000 1995.000
## range 51.00000 11000.00 27.000 2021.000
## sum 629019390.00000 160893142.00 63198295.000 184358.000
## median 20010.00000 4802.00 2012.000 1.000
## mean 20010.16033 5118.28 2010.444 5.865
## SE.mean 0.04162 19.28 0.034 0.091
## CI.mean.0.95 0.08157 37.79 0.067 0.179
## var 54.44258 11686239.73 36.996 262.915
## std.dev 7.37852 3418.51 6.082 16.215
## coef.var 0.00037 0.67 0.003 2.765
## SPAN_AYB SPAN_EYB PRICE_GBA
## nbr.val 31435.00 31435.000 31435.0
## nbr.null 19.00 27.000 0.0
## nbr.na 0.00 0.000 0.0
## min -14.00 -16.000 88.9
## max 262.00 92.000 1445.1
## range 276.00 108.000 1356.2
## sum 2715567.00 1278360.000 12396662.8
## median 88.00 43.000 374.7
## mean 86.39 40.667 394.4
## SE.mean 0.13 0.064 1.1
## CI.mean.0.95 0.25 0.126 2.2
## var 518.61 129.247 38275.6
## std.dev 22.77 11.369 195.6
## coef.var 0.26 0.280 0.5
For the numeric dataset, we generate the best linear model for price with 12 maximal number of variables. We have used three information criteria but all of these criteria have the same result. Thus, we only show the plot of “Adjusted R^2”. Without considering some variables are factor, the most significant variables are gross building area (GBA), date of sale (YR_SALE), number of fireplaces (FIREPLACES), time span of building (SPAN_AYB), time span of an improvement (SPAN_EYB), number of units (NUM_UNITS), heating system (HEAT), qualification (QUALIFIED), and population of a particular district (CENSUS_TRACT). i.e. PRICE ~ BATHRM + NUM_UNITS + GBA + FIREPLACES + QUALIFIED + CENSUS_TRACT + SALE_YR + SPAN_AYB + SPAN_EYB.
rp_num_best <- regsubsets(PRICE~. -PRICE_GBA-ZIPCODE, data = rp_num, method = "seqrep", nvmax = 12)
plot(rp_num_best, scale = "adjr2", main = "Adjusted R^2")
# In the "leaps" package, we can use scale=c("bic","Cp","adjr2","r2")
#plot(rp_num_best, scale = "bic", main = "BIC")
#plot(rp_num_best, scale = "Cp", main = "Cp")
Similar for the best linear model for price per area, we obtain the model with different variables: PRICE_GBA ~ BEDRM (number of bedrooms) + QUALIFIED + GRADE (assessment by other authorities) + CNDTN (condition assessed by public) + FIREPLACES + WARD (eight different districts divided in DC) + SALE_YR + SPAN_AYB + SPAN_EYB. Acctually, we prefer the model w.r.t price per area and we will prove it in the analysis of the later SMART questions.
rp_num_best <- regsubsets(PRICE_GBA~.-PRICE-GBA-ZIPCODE, data = rp_num, method = "seqrep", nvmax = 12)
plot(rp_num_best, scale = "adjr2", main = "Adjusted R^2")
# In the "leaps" package, we can use scale=c("bic","Cp","adjr2","r2")
#plot(rp_num_best, scale = "bic", main = "BIC")
#plot(rp_num_best, scale = "Cp", main = "Cp")
Recall the linear regression model for price and for price per area. All the coefficents of variables are very significant and the regression model fits the data moderately well (with adj. R^2 = 0.693 and 0.567 respectively).
reg_rp_num <- lm(PRICE ~ BATHRM + NUM_UNITS + GBA + FIREPLACES + QUALIFIED + CENSUS_TRACT + SALE_YR + SPAN_AYB + SPAN_EYB, data = rp_num)
summary(reg_rp_num)
##
## Call:
## lm(formula = PRICE ~ BATHRM + NUM_UNITS + GBA + FIREPLACES +
## QUALIFIED + CENSUS_TRACT + SALE_YR + SPAN_AYB + SPAN_EYB,
## data = rp_num)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2549521 -156520 -6689 137777 7115773
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -70037411.126 699831.028 -100.1 <2e-16 ***
## BATHRM 31006.711 2484.506 12.5 <2e-16 ***
## NUM_UNITS -91183.218 3541.307 -25.8 <2e-16 ***
## GBA 334.488 3.417 97.9 <2e-16 ***
## FIREPLACES 98561.721 2281.686 43.2 <2e-16 ***
## QUALIFIED -107198.369 5270.501 -20.3 <2e-16 ***
## CENSUS_TRACT -16.833 0.607 -27.7 <2e-16 ***
## SALE_YR 35007.629 350.331 99.9 <2e-16 ***
## SPAN_AYB 4301.970 86.016 50.0 <2e-16 ***
## SPAN_EYB -10887.898 214.351 -50.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 322000 on 31425 degrees of freedom
## Multiple R-squared: 0.693, Adjusted R-squared: 0.693
## F-statistic: 7.87e+03 on 9 and 31425 DF, p-value: <2e-16
vif(reg_rp_num)
## BATHRM NUM_UNITS GBA FIREPLACES QUALIFIED
## 2.3 1.2 2.5 1.5 1.0
## CENSUS_TRACT SALE_YR SPAN_AYB SPAN_EYB
## 1.3 1.4 1.2 1.8
#rp_num$SALE_YR <- rp_num$SALE_YR - min(rp_num$SALE_YR) #find the reg. result is same
reg_rp_num_pg <- lm(PRICE_GBA ~ BEDRM + QUALIFIED + GRADE + CNDTN + FIREPLACES + WARD + SALE_YR + SPAN_AYB + SPAN_EYB, data = rp_num)
summary(reg_rp_num_pg)
##
## Call:
## lm(formula = PRICE_GBA ~ BEDRM + QUALIFIED + GRADE + CNDTN +
## FIREPLACES + WARD + SALE_YR + SPAN_AYB + SPAN_EYB, data = rp_num)
##
## Residuals:
## Min 1Q Median 3Q Max
## -499.6 -85.3 -11.2 70.7 1266.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -34904.4961 288.9141 -120.8 <2e-16 ***
## BEDRM -30.0676 0.6904 -43.5 <2e-16 ***
## QUALIFIED -72.0525 2.1160 -34.0 <2e-16 ***
## GRADE 5.7892 0.1933 29.9 <2e-16 ***
## CNDTN 10.2249 0.4062 25.2 <2e-16 ***
## FIREPLACES 27.9433 0.8573 32.6 <2e-16 ***
## WARD -22.5467 0.4324 -52.1 <2e-16 ***
## SALE_YR 17.6536 0.1449 121.8 <2e-16 ***
## SPAN_AYB 1.9984 0.0352 56.7 <2e-16 ***
## SPAN_EYB -4.0083 0.0881 -45.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 129 on 31425 degrees of freedom
## Multiple R-squared: 0.567, Adjusted R-squared: 0.567
## F-statistic: 4.57e+03 on 9 and 31425 DF, p-value: <2e-16
vif(reg_rp_num_pg)
## BEDRM QUALIFIED GRADE CNDTN FIREPLACES WARD
## 1.2 1.0 1.3 1.3 1.4 1.3
## SALE_YR SPAN_AYB SPAN_EYB
## 1.5 1.2 1.9
The previous model is still based on the assumption of numeric variables. We just trivially try the same model in the raw variables. Although it may not be the best model, this model still has some reference value for analyzing the best linear model.
It should be noted that the best linear model for price per area fits the data better while the model for price fits the data worse. It is one of the reasons that we prefer to use the measurement of price per area here.
According the VIF values, all of these models have very low VIF values, which means it is not significant these models have the problem of multicolinearity.
reg_rp <- lm(PRICE ~ NUM_UNITS + GBA + FIREPLACES + CENSUS_TRACT + SALE_YR + SPAN_AYB + SPAN_EYB, data = rp)
summary(reg_rp)
##
## Call:
## lm(formula = PRICE ~ NUM_UNITS + GBA + FIREPLACES + CENSUS_TRACT +
## SALE_YR + SPAN_AYB + SPAN_EYB, data = rp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2611842 -158447 -4069 142338 7044929
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -74460583.841 673492.954 -110.6 <2e-16 ***
## NUM_UNITS -82794.941 3477.628 -23.8 <2e-16 ***
## GBA 353.878 2.878 123.0 <2e-16 ***
## FIREPLACES 100988.815 2300.878 43.9 <2e-16 ***
## CENSUS_TRACT -18.208 0.608 -30.0 <2e-16 ***
## SALE_YR 37188.383 337.015 110.3 <2e-16 ***
## SPAN_AYB 4325.679 86.803 49.8 <2e-16 ***
## SPAN_EYB -11884.914 210.514 -56.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 325000 on 31427 degrees of freedom
## Multiple R-squared: 0.687, Adjusted R-squared: 0.687
## F-statistic: 9.85e+03 on 7 and 31427 DF, p-value: <2e-16
vif(reg_rp)
## NUM_UNITS GBA FIREPLACES CENSUS_TRACT SALE_YR
## 1.1 1.8 1.5 1.3 1.3
## SPAN_AYB SPAN_EYB
## 1.2 1.7
reg_rp_pg <- lm(PRICE_GBA ~ BEDRM + QUALIFIED + GRADE + CNDTN + FIREPLACES + WARD + SALE_YR + SPAN_AYB + SPAN_EYB, data = rp)
summary(reg_rp_pg)
##
## Call:
## lm(formula = PRICE_GBA ~ BEDRM + QUALIFIED + GRADE + CNDTN +
## FIREPLACES + WARD + SALE_YR + SPAN_AYB + SPAN_EYB, data = rp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -535.2 -69.3 -7.5 61.1 1248.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -34494.7427 248.1859 -138.99 < 2e-16 ***
## BEDRM -24.9726 0.6075 -41.11 < 2e-16 ***
## QUALIFIEDU -67.4641 1.7803 -37.89 < 2e-16 ***
## GRADEAverage -17.5680 1.7035 -10.31 < 2e-16 ***
## GRADEExcellent 69.4358 3.8965 17.82 < 2e-16 ***
## GRADEExceptional-A 134.3369 6.5534 20.50 < 2e-16 ***
## GRADEExceptional-B 229.0489 10.8798 21.05 < 2e-16 ***
## GRADEExceptional-C 388.1131 21.2781 18.24 < 2e-16 ***
## GRADEExceptional-D 378.2852 27.2779 13.87 < 2e-16 ***
## GRADEFair Quality -21.7109 18.9193 -1.15 0.251
## GRADEGood Quality 23.9241 1.8993 12.60 < 2e-16 ***
## GRADESuperior 119.0405 4.3625 27.29 < 2e-16 ***
## GRADEVery Good 39.9269 2.7157 14.70 < 2e-16 ***
## CNDTNExcellent 213.1744 12.3316 17.29 < 2e-16 ***
## CNDTNFair 11.5871 13.1724 0.88 0.379
## CNDTNGood 32.9560 1.5658 21.05 < 2e-16 ***
## CNDTNPoor 81.9009 44.1474 1.86 0.064 .
## CNDTNVery Good 104.5957 2.3335 44.82 < 2e-16 ***
## FIREPLACES 4.5339 0.7697 5.89 3.9e-09 ***
## WARDWard 2 141.4621 3.4090 41.50 < 2e-16 ***
## WARDWard 3 63.1821 2.9670 21.30 < 2e-16 ***
## WARDWard 4 -56.6996 2.6304 -21.56 < 2e-16 ***
## WARDWard 5 -82.8398 2.7241 -30.41 < 2e-16 ***
## WARDWard 6 20.0013 2.4500 8.16 3.4e-16 ***
## WARDWard 7 -155.1005 3.1477 -49.27 < 2e-16 ***
## WARDWard 8 -201.1174 3.7956 -52.99 < 2e-16 ***
## SALE_YR 17.3763 0.1245 139.60 < 2e-16 ***
## SPAN_AYB 0.6616 0.0367 18.04 < 2e-16 ***
## SPAN_EYB -0.8913 0.0850 -10.48 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 108 on 31406 degrees of freedom
## Multiple R-squared: 0.696, Adjusted R-squared: 0.695
## F-statistic: 2.56e+03 on 28 and 31406 DF, p-value: <2e-16
vif(reg_rp_pg)
## BEDRM QUALIFIEDU GRADEAverage
## 1.3 1.0 1.7
## GRADEExcellent GRADEExceptional-A GRADEExceptional-B
## 1.5 1.4 1.1
## GRADEExceptional-C GRADEExceptional-D GRADEFair Quality
## 1.0 1.1 1.0
## GRADEGood Quality GRADESuperior GRADEVery Good
## 1.7 1.8 1.7
## CNDTNExcellent CNDTNFair CNDTNGood
## 1.1 1.0 1.6
## CNDTNPoor CNDTNVery Good FIREPLACES
## 1.0 1.7 1.6
## WARDWard 2 WARDWard 3 WARDWard 4
## 2.4 3.0 2.6
## WARDWard 5 WARDWard 6 WARDWard 7
## 2.5 2.8 2.4
## WARDWard 8 SALE_YR SPAN_AYB
## 1.6 1.5 1.9
## SPAN_EYB
## 2.5
Our report has 4 aspects of SMART Questions: about floor plan, about location, about time related factors, about facilities.
Generally, the floor plan is the most important element to attract home buyers. Therefore, it must be tightly related to the price. In this part, we will research the correlation between price and variables that are related to the floor plan.
We filter three factors from cleaned data that are the number of bathrooms(BATHRM), the number of bedrooms(BEDRM) and the number of the kitchen(KITCHENS). According to the correlation plot, we will have an overview of the correlation between price and there three factors. Basically, it shows the bathroom tightly relates to price and the kitchen has the weakest correlation. We will figure out the reason in the next steps.
fp_corr <- cor(rp_num[c(1,6,19,8)])
corrplot(fp_corr, method = "square")
Since the number of kitchens looks weakly affected to price, we build two linear regression models to figure out the relationship between price and kitchen.
fpmd <- lm(PRICE ~ BATHRM + BEDRM + KITCHENS , data = rp_num)
summary(fpmd)
##
## Call:
## lm(formula = PRICE ~ BATHRM + BEDRM + KITCHENS, data = rp_num)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2676540 -262056 -48967 189851 10001563
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -107405 9143 -11.8 <2e-16 ***
## BATHRM 290460 3139 92.5 <2e-16 ***
## BEDRM 50725 3112 16.3 <2e-16 ***
## KITCHENS -123866 4855 -25.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 466000 on 31431 degrees of freedom
## Multiple R-squared: 0.356, Adjusted R-squared: 0.356
## F-statistic: 5.79e+03 on 3 and 31431 DF, p-value: <2e-16
vif(fpmd)
## BATHRM BEDRM KITCHENS
## 1.8 1.8 1.2
In this reference model, all of the coefficients are significant. Adjusted R-squared is 0.356 and vif test doesn’t show multicollinearity.
fpmdnokc <- lm(PRICE ~ BATHRM + BEDRM, data = rp_num)
summary(fpmdnokc)
##
## Call:
## lm(formula = PRICE ~ BATHRM + BEDRM, data = rp_num)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2452447 -255578 -47343 192628 9989209
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -181635 8757 -20.7 <2e-16 ***
## BATHRM 281157 3149 89.3 <2e-16 ***
## BEDRM 34721 3080 11.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 471000 on 31432 degrees of freedom
## Multiple R-squared: 0.343, Adjusted R-squared: 0.343
## F-statistic: 8.2e+03 on 2 and 31432 DF, p-value: <2e-16
vif(fpmdnokc)
## BATHRM BEDRM
## 1.7 1.7
boxplot(rp$PRICE~rp$BEDRM, col=topo.colors(20))
boxplot(rp$PRICE~rp$KITCHENS, col=topo.colors(6))
anova(fpmd, fpmdnokc)
## Analysis of Variance Table
##
## Model 1: PRICE ~ BATHRM + BEDRM + KITCHENS
## Model 2: PRICE ~ BATHRM + BEDRM
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 31431 6828125428264322
## 2 31432 6969530954644670 -1 -141405526380348 651 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
After removing KITCHENS, coefficients are still significant, vif test also doesn’t illustrate multicollinearity. However, the Adjusted R-square decreases and the ANOVA test shows the importance of kitchen to the model. These means KITCHENS can improve the model but is not strongly related to price. I guess the reason is that the scale of the number of kitchens is narrow, thus the effect of the kitchen hardly reflects to price.
In this part, we want to have some interaction between speaker and audiences in our presentation, so we would invite a volunteer to tell us how many bathrooms, bedrooms, and kitchens he/she want for his/her dream house in DC.
Price_filter <- rp_num[rp_num$BATHRM== 4 & rp_num$BEDRM==4 & rp_num$KITCHENS == 1,]
Based on the overview of properties in DC, we also want this volunteer to guess the price of his/her dream house. We use the t-test to show if he/she is a good guesser. For example, this audience set 300000 USD at an expected price. In our t-test, we assume the mean = 300000 as our null hypothesis. Then we could get the p-value to measure if his/her guessed number is close to the truth in 0.99 confidence level.
ttest99 = t.test(x=Price_filter$PRICE, mu=300000, conf.level=0.99 )
ttest99
##
## One Sample t-test
##
## data: Price_filter$PRICE
## t = 20, df = 300, p-value <2e-16
## alternative hypothesis: true mean is not equal to 300000
## 99 percent confidence interval:
## 940387 1118092
## sample estimates:
## mean of x
## 1029240
ttest99$conf.int
## [1] 940387 1118092
## attr(,"conf.level")
## [1] 0.99
ttest99$alternative
## [1] "two.sided"
ttest99$estimate
## mean of x
## 1029240
The box plot and summary indicate that even though people can purchase a house on the minimum price, they may not get a perfect one. Nobody wants to live in an extremely narrow house with a lot of rooms.Area is a key factor to measure a dream house(This is a bridge to transit to Area part).
boxplot(Price_filter$PRICE)
summary(Price_filter$PRICE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 212525 615000 920000 1029240 1320000 4300000
According to this scatter plot. Gross building area (GBA) is positively related to price. We color the scatter dots by ‘Landarea’. Since the color distributes irregularly. It reflects ‘Landarea’ may not affect the price. The sharp edges are due to data cleaning.
plot(rp_num$PRICE,rp_num$GBA,col=rp_num$LANDAREA, main = 'GBA-Price(Colored by Landarea)', xlab = 'Price', ylab = 'GBA')
plot(rp_num$PRICE,rp_num$LANDAREA, main = 'Landarea-Price', xlab = 'Price', ylab = 'Landarea')
We want to prove if land area is not related to price. Thus we first draw a scatter plot for land area and price. This plot doesn’t show any regular correlation between them. We also draw Q-Q plots for the GBA and Land area. These plots illustrate that are not normally distributed.
qqnorm(rp_num$LANDAREA, main = "Landarea Q-Q Plot")
qqline(rp_num$LANDAREA)
qqnorm(rp_num$GBA, main = "GBA Q-Q Plot")
qqline(rp_num$GBA)
area_corr <- cor(rp_num[c('LANDAREA','GBA','PRICE')])
corrplot(area_corr, method = "square")
The correlation plot shows it is related to price, but we finally want to use linear regression models and ANOVA test to determine their relationship. According to the ANOVA test, it’s easy to see the land area doesn’t affect price because the p-value is pretty large when we add land area into the model. It means Model 1 and Model 2 are the same. Also, the R-Squares in lm_GBA and lm_GBA_LANDAREA are almost same but lmLANDAREA is pretty small, which support us to make a conclusion: the land area is not a key element to change the price.
lm_GBA_LANDAREA <- lm(formula = PRICE ~ LANDAREA + GBA , data = rp_num)
lm_GBA <- lm(formula = PRICE ~ GBA , data = rp_num)
lm_LANDAREA<- lm(formula = PRICE ~ LANDAREA, data = rp_num)
#summary(lm_GBA)
#summary(lm_GBA_LANDAREA)
#summary(lm_LANDAREA)
anova(lm_GBA, lm_GBA_LANDAREA, lm_LANDAREA)
## Analysis of Variance Table
##
## Model 1: PRICE ~ GBA
## Model 2: PRICE ~ LANDAREA + GBA
## Model 3: PRICE ~ LANDAREA
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 31433 5667532724398895
## 2 31432 5667527409628765 1 5314770130 0.03 0.86
## 3 31433 9147628459924014 -1 -3480101050295249 19300.57 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s study the affect of location on prices. DC has different wards. Let’s see if prices are equally distributed among all wards.
loadPkg("ggmap")
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
data <- read.csv("./data/DC_Properties.csv")
data <- data[!is.na(data$PRICE),] #removing rows with no price
# Location/Ward importance to the price?
wards <- vector()
q1_wards <- vector()
q3_wards <- vector()
max_wards <- vector()
avg_wards <- vector()
#names(wards) <- list(unique(data$WARD))
for (i in sort((unique(data$WARD)))){
max_wards[[i]] <- max(subset(data,WARD==i)$PRICE)
avg_wards[[i]] <- mean(subset(data,WARD==i)$PRICE)
wards[[i]] <- list(sort(list(subset(data,WARD==i)$PRICE)[[1]]))
q1_wards[[i]] <- quantile(wards[[i]][[1]], seq(from = 0, to = 1, by = 0.25))[[2]]
q3_wards[[i]] <- quantile(wards[[i]][[1]], seq(from = 0, to = 1, by = 0.25))[[4]]
}
barplot(avg_wards)
lines(q1_wards)
lines(q3_wards)
Ward location does affect house pricing. Houses in ward 3 are obviously more expensive than the rest. Let’s check if any other location based factors affect the prices too. Each house is on a different floor. We should see if floors affect the pricing just like wards.
stor <- vector()
avg_stor <- vector()
for (i in (unique(data$STYLE))){
avg_stor[[i]] <- mean(subset(data,STYLE==i)$PRICE)
stor[[i]] <- list(sort(list(subset(data,STYLE==i)$PRICE)[[1]]))
}
barplot(order(unlist(avg_stor)), horiz = FALSE, names.arg=sort(names(avg_stor)))
Relatively, there is no specific floor which gives the house a higher price like ward does. Let’s analyse each and every ward of DC. Diving the house prices of each ward into quantiles and then visualising will tell us if the distrubution is truly location based or some other factors are affecting the prices.
register_google(key = "AIzaSyAhGGV1_J9ipBAsF6vE7fg56zjDy_uaCvA") #to be kept private
#plot all wards to see ward location
qmap_name <- "Washington dc"
dc <-qmap(qmap_name, zoom=12,maptype = "terrain")
## Source : https://maps.googleapis.com/maps/api/staticmap?center=Washington%20dc&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Washington+dc&key=xxx
data2 <- data[!is.na(data$WARD),]
dc <- dc + geom_point(aes(x = LONGITUDE, y = LATITUDE, colour = WARD),data = data2) + ggtitle("DC WARDS")
print(dc)
## Warning: Removed 1666 rows containing missing values (geom_point).
wards <- vector()
for (i in sort((unique(data$WARD)))){
cat(i)
wards[[i]] <- list(sort(list(subset(data,WARD==i)$PRICE)[[1]]))
q1 <- quantile(wards[[i]][[1]], seq(from = 0, to = 1, by = 0.25))[[2]]
q2 <- quantile(wards[[i]][[1]], seq(from = 0, to = 1, by = 0.25))[[3]]
q3 <- quantile(wards[[i]][[1]], seq(from = 0, to = 1, by = 0.25))[[4]]
data2 <- subset(data,WARD==i)
data2$Quantile[data2$PRICE > q3] <- "4"
data2$Quantile[data2$PRICE < q3] <- "3"
data2$Quantile[data2$PRICE < q2] <- "2"
data2$Quantile[data2$PRICE < q1] <- "1"
data2 <- data2[!is.na(data2$Quantile),]
qmap_name <- paste(c(i ,", washington dc"), collapse = "")
dc <-qmap(qmap_name, zoom=15)
dc <- dc + geom_point(aes(x = LONGITUDE, y = LATITUDE, colour = Quantile),data = data2) + ggtitle(i)
#png(filename= paste(c(i,".png"), collapse = ""))
print(dc)
#dev.off()
}
## Ward 1
## Source : https://maps.googleapis.com/maps/api/staticmap?center=Ward%201,%20washington%20dc&zoom=15&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Ward+1,+washington+dc&key=xxx
## Warning: Removed 12446 rows containing missing values (geom_point).
## Ward 2
## Source : https://maps.googleapis.com/maps/api/staticmap?center=Ward%202,%20washington%20dc&zoom=15&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Ward+2,+washington+dc&key=xxx
## Warning: Removed 9683 rows containing missing values (geom_point).
## Ward 3
## Source : https://maps.googleapis.com/maps/api/staticmap?center=Ward%203,%20washington%20dc&zoom=15&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Ward+3,+washington+dc&key=xxx
## Warning: Removed 11645 rows containing missing values (geom_point).
## Ward 4
## Source : https://maps.googleapis.com/maps/api/staticmap?center=Ward%204,%20washington%20dc&zoom=15&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Ward+4,+washington+dc&key=xxx
## "Ward 4, washingto..." not uniquely geocoded, using "ward ct nw, washington, dc 20036, usa"
## Warning: Removed 11969 rows containing missing values (geom_point).
## Ward 5
## Source : https://maps.googleapis.com/maps/api/staticmap?center=Ward%205,%20washington%20dc&zoom=15&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Ward+5,+washington+dc&key=xxx
## Warning: Removed 9730 rows containing missing values (geom_point).
## Ward 6
## Source : https://maps.googleapis.com/maps/api/staticmap?center=Ward%206,%20washington%20dc&zoom=15&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Ward+6,+washington+dc&key=xxx
## Warning: Removed 13506 rows containing missing values (geom_point).
## Ward 7
## Source : https://maps.googleapis.com/maps/api/staticmap?center=Ward%207,%20washington%20dc&zoom=15&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Ward+7,+washington+dc&key=xxx
## "Ward 7, washingto..." not uniquely geocoded, using "4645 nannie helen burroughs ave ne, washington, dc 20019, usa"
## Warning: Removed 6438 rows containing missing values (geom_point).
## Ward 8
## Source : https://maps.googleapis.com/maps/api/staticmap?center=Ward%208,%20washington%20dc&zoom=15&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Ward+8,+washington+dc&key=xxx
## Warning: Removed 4740 rows containing missing values (geom_point).
## Ward 2: Some of the most expensive houses are locaated in this ward. Almost all houses have high range, which means average value of houses in general might be less due to some houses in this ward due to other factors but overall the price is high. ## Ward 3: Almost all houses are expensive. Any house in this ward will have a high price. This means prices here are affected more by the location than any other factor. ## Ward 5, 7 and 8: These areas in DC have equal distribution from affordable houses to expensive ones. In such cases, location plays minor role and other factors such as land area are more important.
The other factors mentioned could be time. Which is why time analysis is important. Sell date of a house coupled with inflation could tell us a lot more. Let’s analyse price on basis of when the house was sold.
get_year <- function(x){
return(as.numeric(substring(x,0,4)))
}
sold_price <- vector()
sold_num <- vector()
data2 <- data[!is.na(data$SALEDATE),]
data2$sell_year <- get_year(data2$SALEDATE)
data2 <- data2[!is.na(data2$sell_year),]
data2 <- subset(data2,sell_year>1990)
#sapply(data2$sell_year, numeric)
for (i in sort(unique(data2$sell_year))){
data3 <- subset(data2,sell_year==i)
sold_num[[as.character(i)]] <- nrow(data3)
sold_price[[as.character(i)]] <- mean(data3$PRICE)
}
plot(sold_num, type = "s", xaxt="n")
options(scipen=20)
axis(1,at=1:length(unique(data2$sell_year)),labels=names(sold_num))
(Further analysis of time/sell year in 3.3.2)
In this part, we explored the relationship between the internal facilities and the price, mainly focusing on two parts: the heating and cooling facilites. By the way, in our dataset, AC (the air conditioner) represents the cooling facilities and the heat represents the heating facilities.
First, since the variables of AC and HEAT are characters, we use ‘table()’ function to count the frequency of different types in the two variables. As shown below, HEAT has 14 levels indicating 14 types of heaters while AC has only 2 levels indicating whether propeties have or don’t have the air conditioners.
names(rp)
## [1] "BATHRM" "HEAT" "AC"
## [4] "NUM_UNITS" "ROOMS" "BEDRM"
## [7] "STORIES" "PRICE" "QUALIFIED"
## [10] "SALE_NUM" "GBA" "STYLE"
## [13] "STRUCT" "GRADE" "CNDTN"
## [16] "EXTWALL" "ROOF" "INTWALL"
## [19] "KITCHENS" "FIREPLACES" "LANDAREA"
## [22] "ZIPCODE" "ASSESSMENT_NBHD" "ASSESSMENT_SUBNBHD"
## [25] "CENSUS_TRACT" "WARD" "QUADRANT"
## [28] "SALE_YR" "SPAN_YR_RMDL" "SPAN_AYB"
## [31] "SPAN_EYB" "PRICE_GBA"
temp_price<- rp[c(2,3,8,12)]
temp_price_n<- rp_num[c(2,3,8,12)]
temp_price[is.null(temp_price)]<-"0"
str(temp_price)
## 'data.frame': 31435 obs. of 4 variables:
## $ HEAT : Factor w/ 14 levels "Air Exchng","Air-Oil",..: 13 8 8 8 13 13 13 8 6 6 ...
## $ AC : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
## $ PRICE: num 1095000 2100000 1602000 1050000 1430000 ...
## $ STYLE: Factor w/ 19 levels "","1 Story","1.5 Story Fin",..: 8 8 8 8 5 5 5 5 5 5 ...
table(temp_price$HEAT)
##
## Air Exchng Air-Oil Elec Base Brd Electric Rad Evp Cool
## 6 16 31 16 8
## Forced Air Gravity Furnac Hot Water Rad Ht Pump Ind Unit
## 13685 16 9529 554 4
## No Data Wall Furnace Warm Cool Water Base Brd
## 0 32 7479 59
table(temp_price$AC)
##
## N Y
## 4563 26872
length(table(temp_price$HEAT))
## [1] 14
length(table(temp_price$STYLE))
## [1] 19
We draw a correlation matrix of the sub-dataset to show the correlation between cooling facilities, heating facilites and the properties’price.It??s very clear the price of shows a relative high correlation with the heating and cooling facilities.
temp_price_corr <- cor(rp_num[c('HEAT','AC','PRICE')])
corrplot(temp_price_corr, method = "square")
First, from the aspect of heating facilites, we use bar chart to reflect the using frequency of different types of heaters. We can see forced air, hot water rad and warm cool are used most.
HEAT_freq<-as.data.frame(table(temp_price$HEAT))
names(HEAT_freq)<-c("Types","Frequence")
ggplot(data=HEAT_freq, mapping=aes(x=Types,y=Frequence))+
geom_bar(stat="identity",size=50,fill="#D55E00")+coord_flip()
When it comes to the relationship between heating facilities and price, we use the Analysis of Variance (ANOVA). It’s clear that the p-value is quite small, we can say the heating facilities do have effects on the price.
HEAT <- subset(rp, select=c('PRICE','HEAT'))
aov_p_h <- aov(PRICE~HEAT, data=HEAT)
aov_p_h
## Call:
## aov(formula = PRICE ~ HEAT, data = HEAT)
##
## Terms:
## HEAT Residuals
## Sum of Squares 545246686635119 10059099190935380
## Deg. of Freedom 12 31422
##
## Residual standard error: 565800
## Estimated effects may be unbalanced
summary(aov_p_h)
## Df Sum Sq Mean Sq F value
## HEAT 12 545246686635119 45437223886260 142
## Residuals 31422 10059099190935380 320129183086
## Pr(>F)
## HEAT <0.0000000000000002 ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since there are so many types of heaters, we may wonder whether differen types of heaters have different effects on the price.From boxplot, we can see the price of some types shows a larger or smaller scope such as Air Exchng and Evp Cool, but the price of the three types used most-forced air, hot water rad and warm cool- show the similar medium and scope, which indicate people may not really care about what types of heaters in your house as long as it is warm enough.
plot(PRICE~HEAT, data=HEAT,col=topo.colors(20))
As for cooling systems, we only have information of yes or no. As the bar chart shows, most residiential properties do have the cooling systems(air conditioners).
AC_freq<-as.data.frame(table(temp_price$AC))
names(AC_freq)<-c("YN","Frequence")
ggplot(data=AC_freq, mapping=aes(x=YN,y=Frequence))+
geom_bar(stat="identity",width=0.4,fill="#D55E00")
We use the boxplot to compare the price when properties with or without cooling systems(air conditioners). As shown below, we can see the price shows a whole ascending trend from no to yes, which can be concluded that the cooling systems(air conditioners) do have effects on the price of residitial properties.
plot(x=rp$AC,y=rp$PRICE,log="y")
To prove the relationship between price and cooling facilities, We also use the Analysis of Variance (ANOVA), the p-value is quite small, we can say the cooling facilities(air conditioners) do have significant effects on price as well.
AC = subset(rp, PRICE>0, select=c('PRICE','AC'))
aov_p_c = aov(PRICE~AC, data=AC)
aov_p_c
## Call:
## aov(formula = PRICE ~ AC, data = AC)
##
## Terms:
## AC Residuals
## Sum of Squares 296500103900770 10307845773669598
## Deg. of Freedom 1 31433
##
## Residual standard error: 572652
## Estimated effects may be unbalanced
summary(aov_p_c)
## Df Sum Sq Mean Sq F value
## AC 1 296500103900770 296500103900770 904
## Residuals 31433 10307845773669598 327930702563
## Pr(>F)
## AC <0.0000000000000002 ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Furthermore, we use T-test to prove it. We set the null hypothesis as whether properties have or have no air conditioners, the means of the price are equal.And we set the alternative hypothesis as when properties have or have no air conditioners , the means of the price equal are not equal.Besides, we construct t-intervals at 0.999 level. As the result shows below, the P-value is less than 0.001, so we reject that the price of properties with or without have the same mean values. So air conditioners do have effects on price.
AC_Y<-subset(rp_num,AC==2)
AC_N<-subset(rp_num,AC==1)
t.test(AC_Y$PRICE, mu = mean(AC_N$PRICE), conf.level = 0.99)
##
## One Sample t-test
##
## data: AC_Y$PRICE
## t = 70, df = 30000, p-value <0.0000000000000002
## alternative hypothesis: true mean is not equal to 471795
## 99 percent confidence interval:
## 737982 757017
## sample estimates:
## mean of x
## 747500
In conclusion, we realize that data cleaning is a very important step for data analysis. Before our SMART questions, we’ve spent a lot time in data preprocessing and cleaning. If we take a look at this clean-out data, they are very unhealthy and may lead our analysis to the wrong directions. For the conlusion of SMART Questions: we find gross building area is a significant factor to decide the price of an property; It is significant that certain areas (ward 2&3) have higher housing price in DC; growth rate of the housing price is twice than the inflation rate during the recent 25 years, nevertheless, it is still a steady rate; As for the facilities in the properties, both cooling and heating affect price, but types of heating do not matter to the price.